Evaluating the Evaluation: A Case Study Using the TREC 2002 Question Answering Track

نویسنده

Ellen M. Voorhees

چکیده

Evaluating competing technologies on a common problem set is a powerful way to improve the state of the art and hasten technology transfer. Yet poorly designed evaluations can waste research effort or even mislead researchers with faulty conclusions. Thus it is important to examine the quality of a new evaluation task to establish its reliability. This paper provides an example of one such assessment by analyzing the task within the TREC 2002 question answering track. The analysis demonstrates that comparative results from the new task are stable, and empirically estimates the size of the difference required between scores to confidently conclude that two runs are different. Metric-based evaluations of human language technology such as MUC and TREC and DUC continue to proliferate (Sparck Jones, 2001). This proliferation is not difficult to understand: evaluations can forge communities, accelerate technology transfer, and advance the state of the art. Yet evaluations are not without their costs. In addition to the financial resources required to support the evaluation, there are also the costs of researcher time and focus. Since a poorly defined evaluation task wastes research effort, it is important to examine the validity of an evaluation task. In this paper, we assess the quality of the new question answering task that was the focus of the TREC 2002 question answering track. TREC is a workshop series designed to encourage research on text retrieval for realistic applications by providing large test collections, uniform scoring procedures, and a forum for organizations interested in comparing results. The conference has focused primarily on the traditional information retrieval problem of retrieving a ranked list of documents in response to a statement of information need, but also includes other tasks, called tracks, that focus on new areas or particularly difficult aspects of information retrieval. A question answering (QA) track was started in TREC in 1999 (TREC-8) to address the problem of returning answers, rather than document lists, in response to a question. The task for each of the first three years of the QA track was essentially the same. Participants received a large corpus of newswire documents and a set of factoid questions such as How many calories are in a Big Mac? and Who invented the paper clip?. Systems were required to return a ranked list of up to five [document-id, answerstring] pairs per question such that each answer string was believed to contain an answer to the question. Human assessors read each string and decided whether the string actually did contain an answer to the question. An individual question received a score equal to the reciprocal of the rank at which the first correct response was returned, or zero if none of the five responses contained a correct answer. The score for a submission was then the mean of the individual questions’ reciprocal ranks. Analysis of the TREC-8 track confirmed the reliability of this evaluation task (Voorhees and Tice, 2000): the assessors understood and could do their assessing job; relative scores between systems were stable despite differences of opinion by assessors; and intuitively better systems received better scores. The task for the TREC 2002 QA track changed significantly from the previous years’ task, and thus a new assessment of the track is needed. This paper provides that assessment by examining both the ability of the human assessors to make the required judgments and the effect that differences in assessor opinions have on comparative results, plus empirically establishing confidence intervals for the reliability of a comparison as a function of the difference in effectiveness scores. The first section defines the 2002 QA task and provides a brief summary of the system results. The following three sections look at each of the evaluation issues in turn. The final secEdmonton, May-June 2003 Main Papers , pp. 181-188 Proceedings of HLT-NAACL 2003

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Evaluation of Question Answering Systems: Lessons Learned from the TREC QA Track

The TREC question answering (QA) track was the first large-scale evaluation of open-domain question answering systems. In addition to successfully fostering research on the QA task, the track has also been used to investigate appropriate evaluation methodologies for question answering systems. This paper gives a brief history of the TREC QA track, motivating the decisions made in its implementa...

متن کامل

Evaluating Question-Answering Techniques in Chinese

An important first step in developing a cross-lingual question answering system is to understand whether techniques developed with English text will also work with other languages, such as Chinese. The Marsha Chinese question answering system described in this paper uses techniques similar to those used in the English systems developed for TREC. Marsha consists of three main components: the que...

متن کامل

A Hybrid Approach to Clinical Question Answering

In this paper, we describe our clinical question answering system developed and submitted for the Text Retrieval Conference (TREC 2014) Clinical Decision Support (CDS) track. The task for this track was to retrieve relevant biomedical articles to answer generic clinical questions about medical case reports. As part of our maiden participation in TREC, we submitted a single run using a hybrid Na...

متن کامل

Question Answering: CNLP at the TREC 2002 Question Answering Track

This paper describes the retrieval experiments for the main task and list task of the TREC-2002 question-answering track. The question answering system described automatically finds answers to questions in a large document collection. The system uses a two-stage retrieval approach to answer finding based on matching of named entities, linguistic patterns, keywords, and the use of a new inferenc...

متن کامل

The TREC-8 Question Answering Track Report

The TREC-8 Question Answering track was the rst large-scale evaluation of domain-independent question answering systems. This paper summarizes the results of the track by giving a brief overview of the di erent approaches taken to solve the problem. The most accurate systems found a correct response for more than 2/3 of the questions. Relatively simple bag-of-words approaches were adequate for ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

Evaluating the Evaluation: A Case Study Using the TREC 2002 Question Answering Track

نویسنده

چکیده

منابع مشابه

The Evaluation of Question Answering Systems: Lessons Learned from the TREC QA Track

Evaluating Question-Answering Techniques in Chinese

A Hybrid Approach to Clinical Question Answering

Question Answering: CNLP at the TREC 2002 Question Answering Track

The TREC-8 Question Answering Track Report

عنوان ژورنال:

اشتراک گذاری